Discovering data dependencies in Web content mining

نویسندگان

  • José Carlos Cortizo
  • J. Ignacio Giráldez
چکیده

Web content mining opens up the possibility to use data presented in web pages for the discovery of interesting and useful patterns. Our web mining tool, FBL (Filtered Bayesian Learning), performs a two stage process: first it analyzes data present in a web page, and then, using information about the data dependencies encountered, it performs the mining phase based on bayesian learning. The Näive Bayes classifier is based on the assumption that the attribute values are conditionally independent for a given the class. This makes it perform very well in some data domains, but performs poorly when attributes are dependent. In this paper, we try to identify those dependencies using linear regression on the attribute values, and then eliminate the attributes which are a linear combination of one or two others. We have tested this system on six web domains (extracting the data by parsing the html), where we have added a synthetic attribute which is a linear combination of two of the original ones. The system detects perfectly those synthetic attributes and also some “natural” dependent attributes, obtaining a more accurate classifier.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Effective web log mining and online navigational pattern prediction

The web has become the world's largest repository of knowledge. Web usage mining is the process of discovering knowledge from the interactions generated by the user in the form of access logs, cookies, and user sessions data. Web Mining consists of three different categories, namely Web Content Mining, Web Structure Mining, and Web Usage Mining (is the process of discovering knowledge from the ...

متن کامل

A Survey on Web Mining Using Fuzzy Logic

The Internet has unlimited resources of knowledge and is widely used in many applications. Web mining plays an important role in discovering such knowledge, it is roughly divided into three categories : Web Content Mining , Web Usage Mining and Web Structure Mining. The web consists of imprecise, incomplete and uncertain data and knowledge. Fuzzy Set Theory is often used to handle such data. Se...

متن کامل

Data Mining For Web Security: UserWatcher

Data mining techniques have proved to be efficient for discovering interesting and useful patterns in large amount of data such as in Web documents. This paper investigates the use of mining techniques to secure Web access. We propose UserWatcher, a mining tool that integrates Web usage mining and Web content mining to find potential correlations between data that a user accesses and the data t...

متن کامل

Discovering task-oriented usage pattern for web recommendation

Web transaction data usually convey user task-oriented behaviour pattern. Web usage mining technique is able to capture such informative knowledge about user task pattern from usage data. With the discovered usage pattern information, it is possible to recommend Web user more preferred content or customized presentation according to the derived task preference. In this paper, we propose a Web r...

متن کامل

User Content Mining Supporting Usage Content for Web Personalization

In Web personalization usage mining has been used in combination of standard methods to help predict user needs based on their transaction histories. Although information in usage logs of a Web server reflects the interests of users to the site, the users are potentially not aware of all information needs that can be addressed through the Web server. User data (content) is a valuable source for...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004